{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M1.01"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Imagine we are interested in predicting penguins species based on two of their\n",
    "body measurements: culmen length and culmen depth. First we want to do some\n",
    "data exploration to get a feel for the data.\n",
    "\n",
    "What are the features? What is the target?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "The features are `\"culmen length\"` and `\"culmen depth\"`. The target is the\n",
    "penguin species."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is located in `../datasets/penguins_classification.csv`, load it with\n",
    "`pandas` into a `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show a few samples of the data.\n",
    "\n",
    "How many features are numerical? How many features are categorical?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Both features, `\"culmen length\"` and `\"culmen depth\"` are numerical. There are\n",
    "no categorical features in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "penguins.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What are the different penguins species available in the dataset and how many\n",
    "samples of each species are there? Hint: select the right column and use the\n",
    "[`value_counts`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)\n",
    "method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "penguins[\"Species\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot histograms for the numerical features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "_ = penguins.hist(figsize=(8, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show features distribution for each class. Hint: use\n",
    "[`seaborn.pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import seaborn\n",
    "\n",
    "pairplot_figure = seaborn.pairplot(penguins, hue=\"Species\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that the labels on the axis are overlapping. Even if it is not the\n",
    "priority of this notebook, one can tweak them by increasing the height of each\n",
    "subfigure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "pairplot_figure = seaborn.pairplot(penguins, hue=\"Species\", height=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at these distributions, how hard do you think it would be to classify\n",
    "the penguins only using `\"culmen depth\"` and `\"culmen length\"`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Looking at the previous scatter-plot showing `\"culmen length\"` and `\"culmen\n",
    "depth\"`, the species are reasonably well separated:\n",
    "- low culmen length -> Adelie\n",
    "- low culmen depth -> Gentoo\n",
    "- high culmen depth and high culmen length -> Chinstrap\n",
    "\n",
    "There is some small overlap between the species, so we can expect a\n",
    "statistical model to perform well on this dataset but not perfectly."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}